Projections using Hypertuned model through XGboost
All data is from FanGraphs. I have no affiliation with FanGraphs, but please consider contributing to their website if you found this project informative.
This project is designed to showcase how Using a Percentile Based Worth System values Fantasy Baseball Players through a Inning Pitched (IP) weighted projection
The Categories used for prediction valuation are year-end rankings for the following metrics:
First we need to load the packages that R needs to run the analysis
library(sqldf) #SQL in R
library(skimr) #Summaries and useful for removing low % data
library(ggplot2) #Plotting Functions
library(plyr) #slightly deprecated data cleaning
library(dplyr) #slightly updated data cleaning
library(tidyverse) #tidyverse data cleaning universe
library(caret) #wrapper for creating, tuning and validating models
library(xgboost) #package for creating regression tree model
library(vtreat) # useful package for treating data before modeling
library(Matrix)
library(Boruta)
library(mgcv)
library(moments) #for measuring skewness
library(data.table) #alternative to dplyr we use to create lags
library(pdp) #partial dependence graphs
library(vip) #variable importance
library(grid) #put multiple plots on one grid
library(gridExtra) #additional grid functionality
library(janitor) #one function used to clean transposed data set
library(ggpubr) #for qq plot
library(tableHTML)
library(kableExtra)
library(owmr)The # comments generally explain what additional functionality each library adds to R
All data is downloaded from Fan Graphs. From this location. The data is also available on my Github here. There are player level and team data sets
#data read-in
pitcher_data <- read_csv("FanGraphs Leaderboard_Pitching20IP.csv")#Team datasets
FDG_Team = read_csv("FanGraphs Leaderboard_Team.csv")#Create a prefix for all team stats that starts with T_
FDG_Team2 <- FDG_Team %>%
rename_with( ~ paste0("T_", .x))str give information about an object, while
skim provides a customizable summary
#Output not shown for space
#str(FDG_Team2)
skim(FDG_Team2) %>%
tibble::as_tibble()NA
NAskim let’s us see how the data was imported into R.
Documentation can be found here
#Full Dataset dimensions
skimr::skim(pitcher_data) %>%
tibble::as_tibble() %>%
select(skim_type,skim_variable,complete_rate) %>%
filter(complete_rate >0.30) #288 Variables
#skim_type - character or numeric
#skim_variable - name of variable
#complete_rate - % of data that is not missing
#filter - only keep variables that have 30% of data populatedAdditionally let’s look at how variables vary by year to see if there are any discrepancies there
#It looks like one year, there were fewer games played, and there is a clear drop off in home runs
pitcher_data_dist =
pitcher_data %>%
group_by(Season) %>%
summarize (Max_Games = max(G),
Avg_W= mean(W)
)
pitcher_data_dist
ggplot(pitcher_data_dist, aes(Season, Avg_W)) +
geom_col()+
ggtitle("Average Wins by Year")+
theme(plot.title = element_text(hjust = 0.5,size = 22,color ="steel blue"))NA
NA
NAWhat are some issues with the data?
Many of Variables, such as K%, are being read in as characters
There is spotty data coverage in some of the variables (~Variables have less than 30% Coverage)
2020 Data only includes 60 games worth of data
Team Data needs to be appended to pitcher Data by Team Name
There are several ways to do this, we will identify the variables we
want to change that are mis-identified. parse_number can be
used to pull numbers from these variables. Additional ways to tackle
this can be found here
#Select Column names that are characters but not Team or Name, These should be percentages
pitcher_data_chars_to_convert <- pitcher_data %>%
select_if(is.character)%>% select(-Team,-Name) %>%
mutate_all (function(x) as.numeric(readr::parse_number(x))/100)
#Note : There are additional ways to do this, this is just one solution
#We can exclude the variables we converted and reintroduce them
pitcher_data_num <- pitcher_data %>% select(-colnames(pitcher_data_chars_to_convert))
pitcher_data2 = cbind(pitcher_data_num,pitcher_data_chars_to_convert) %>%
select (colnames(pitcher_data)) %>% #preserve original order
dplyr::rename(flyball_perc = `FB%...50`,fastball_perc = `FB%...74`) #rename two ambiguous columns
skim(pitcher_data2) %>%
as_tibble() %>%
group_by(skim_type) %>%
count()
#Logical variables are R's best guess, in our case they are all NA's and will be removedThe same can be done for the Team Data that is loaded
#Select Column names that are characters but not Team or Name, These should be percentages
FDG_Team2_chars_to_convert <- FDG_Team2 %>%
select_if(is.character)%>% select(-T_Team) %>%
mutate_all (function(x) as.numeric(readr::parse_number(x))/100)
#Keep in mind, parse number may make actual characters into numerical variables so carefully check your data before using
#We can exclude the variables we converted and reintroduce them
FDG_Team2_num <- FDG_Team2 %>% select(-colnames(FDG_Team2_chars_to_convert))
FDG_Team3 = cbind(FDG_Team2_num,FDG_Team2_chars_to_convert) %>%
select (colnames(FDG_Team2)) %>% #preserve original order
dplyr::rename(T_flyball_perc = `T_FB%...45`,T_fastball_perc = `T_FB%...72`)
skim(FDG_Team3) %>%
as_tibble() %>%
group_by(skim_type) %>%
count()NA
NA
NAI choose 30% coverage of data necessary but this can be adjusted up
or down. This will also get rid of columns that are all
NA.
# Keep variables with enough values (Need 30% data coverage rate here)
Player_cols_to_keep =
skim(pitcher_data2) %>%
dplyr::select(skim_type, skim_variable, complete_rate) %>%
filter (complete_rate > 0.30)
#Transpose Rows to get column names as skim melts the data
Player_cols_to_keep_transpose = t(Player_cols_to_keep)
#extract the colnames we would like to keep
Player_cols_to_keep = colnames(janitor::row_to_names(Player_cols_to_keep_transpose,row_number = 2))
#Only keep the columns designated to have over 30% of their data populated or greater
pitcher_data3 = pitcher_data2 %>%
select(one_of(Player_cols_to_keep)) Repeat the process for Team Variables
Team_cols_to_keep =
skim(FDG_Team3) %>%
dplyr::select(skim_type, skim_variable, complete_rate) %>%
filter (complete_rate > 0.30)
#Transpose Rows to get column names as skim melts the data
Team_cols_to_keep_transpose = t(Team_cols_to_keep)
#extract the colnames we would like to keep
Team_cols_to_keep = colnames(janitor::row_to_names(Team_cols_to_keep_transpose,row_number = 2))
#Only keep the columns designated to have over 30% of their data populated or greater
FDG_Team4 = FDG_Team3 %>%
select(one_of(Team_cols_to_keep)) Some Variables will need to be normalized by Innings_Pitched (IP) if they aren’t a percentage already. Remaining Variables are percentages or indices so will not need to be transformed
pitcher_data4 = pitcher_data3 %>%
mutate( #create new variables based on existing variables
W_IP = W/IP,
L_IP = L/IP,
ShO_IP = ShO/IP,
SV_IP = SV/IP,
BS_IP = BS/IP,
TBF_IP = TBF/IP,
H_IP = H/IP,
R_IP = R/IP,
ER_IP = ER/IP,
HR_IP=HR/IP,
BB_IP=BB/IP,
IBB_IP=IBB/IP,
HBP_IP=HBP/IP,
WP_IP= WP/IP,
BK_IP=BK/IP,
SO_IP=SO/IP,
GB_IP = GB/IP, #Groundballs
FB_IP = FB/IP, #FlyBalls
LD_IP = LD/IP, #LineDrives
IFFB_IP = IFFB/IP, #Infield Fly balls
Balls_IP= Balls/IP,
Strikes_IP= Strikes/IP,
Pitches_IP= Pitches/IP,
RS_IP= RS/IP,
IFH_IP= IFH/IP,
BU_IP= BU/IP,
BUH_IP= BUH/IP,
Pulls_IP= Pulls/IP,
HLD_IP= HLD/IP,
SD_IP= SD/IP,
MD_IP= MD/IP,
Barrels_IP= Barrels/IP,
HardHits_IP= HardHit/IP
) %>% select(-L,-G,-IP,-ShO,-BS,-(TBF:BK),-(GB:BUH),-Pulls,-(SD:MD),-Barrels,-HardHit,-Events)
#will be removed after lags -FIP,-(RAR:WPA),,-(wFB:wCH),-(`ERA-`:`xFIP-`),-SIERA,-(`RA9-WAR`:`Age Rng`),-kwERA,-`wCH (pi)`:`wSL (pi)`,`K/9+`:`HR/FB%+`) #Drop the old variables
#Be careful about RS - Run Support and RS/9
#skim(pitcher_data4) %>% as_tibble()Repeat the process for Team Variables
FDG_Team5 = FDG_Team4 %>%
mutate( #create new variables based on existing variables
T_H_T_PA = T_H/T_PA,
T_x1B_T_PA = T_1B/T_PA, #note: R can't have variables start with a number
T_x2b_T_PA = T_2B/T_PA,
T_x3b_T_PA = T_3B/T_PA,
T_HR_T_PA = T_HR/T_PA,
T_R_T_PA = T_R/T_PA,
T_RBI_T_PA = T_RBI/T_PA,
T_BB_T_PA = T_BB/T_PA,
T_IBB_T_PA = T_IBB/T_PA,
T_SO_T_PA=T_SO/T_PA,
T_HBP_T_PA=T_HBP/T_PA,
T_SF_T_PA=T_SF/T_PA,
T_SH_T_PA=T_SH/T_PA,
T_GDP_T_PA= T_GDP/T_PA,#ground into double play
T_SB_T_PA=T_SB/T_PA,
T_CS_T_PA=T_CS/T_PA,
T_GB_T_PA = T_GB/T_PA, #Groundballs
T_FB_T_PA = T_FB/T_PA, #FlyBalls
T_LD_T_PA = T_LD/T_PA, #LineDrives
T_IFFB_T_PA = T_IFFB/T_PA, #Infield Fly balls
T_Pitches_T_PA= T_Pitches/T_PA,
T_Balls_T_PA= T_Balls/T_PA,
T_Strikes_T_PA= T_Strikes/T_PA,
T_IFH_T_PA= T_IFH/T_PA,
T_BU_T_PA= T_BU/T_PA,
T_BUH_T_PA= T_BUH/T_PA,
T_PH_T_PA= T_PH/T_PA,
T_Barrels_T_PA= T_Barrels/T_PA,
T_HardHits_T_PA= T_HardHit/T_PA
) %>% select(-(T_H:T_CS),-(T_GB:T_BUH),-T_PH,-T_Barrels,-T_HardHit,-T_Events) #Drop the old variables
#skim(FDG_Team5) %>% as_tibble()There are several ways to lag a dataset BY
GROUP.
* Dplyr way is here..
* While data.table (the method used below) is here.
#Note we will only be lagging the player level data, as the previous year's team performance shouldn't impact current performance
#Order the dataset by lag columns
pitcher_data5 = arrange(pitcher_data4, playerid,Season) #playerid is the Fangraph id assigned to each player
# Convert dataframe to data.table format
DT_pitcher = data.table(pitcher_data5)
#designate columns to lag - which is all of them
cols1 = colnames(pitcher_data5)
anscols = paste("lag", cols1, sep="_")
DT_pitcher[, (anscols) := data.table::shift(.SD, 1, NA, "lag"),by ='playerid', .SDcols=cols1] #Create 1 period lags by year
pitcher_data6 = as.data.frame(DT_pitcher) %>% select(-lag_playerid, -lag_Team, -lag_Season, -lag_Age,-lag_Name)
ncol(pitcher_data5) #250 - no lags[1] 251
ncol(pitcher_data6) #495 - lagged data ~ (250 * 2)-5[1] 497
We can use either the merge function or the SQL
functionality provided by the sqldf package to join the
lagged player level data to the Team level data
df_pitching_init = sqldf(
"
select a.*, b.*
from pitcher_data6 a
left join FDG_Team5 b
on a.Team = b.T_Team and a.Season = b.T_Season
"
) %>% select(-T_Team,-T_Season,-T_Age,-T_G,-T_AB)# Unncessary Team Variables
nrow(df_pitching_init) - nrow(pitcher_data6) #check if any rows are duplicated[1] 0
We can use Percentile based ranking to get rankings for players from the 2021 season.
Each player goes from a 0% to 100% on each percentile stat that is used for creating a scoring opportunity. All data is already normalized by plate appearances, but must now be ranked for each year.
#Categories I include are:
#Wins, Saves, WHIP, ERA, SOs, Holds
df_pitching_init2 = df_pitching_init %>%
# arrange(player_id,year) %>%
group_by(Season) %>%
mutate(
Wins_share = order(order(rank(W_IP,ties.method = 'average'),decreasing = FALSE))/n(),
SO_share = order(order(rank(SO_IP,ties.method = 'average'),decreasing = FALSE))/n(),
SV_share = order(order(rank(SV_IP,ties.method = 'average'),decreasing = FALSE))/n(),
WHIP_share = order(order(rank(WHIP,ties.method = 'average'),decreasing = FALSE))/n(),
ERA_share = order(order(rank(ERA,ties.method = 'average'),decreasing = FALSE))/n(),
HLD_share = 0,
Worth = Wins_share+SO_share+SV_share+WHIP_share+ERA_share+HLD_share
) %>%
ungroup() Chart of the Distribution of initial percentiles
As the chart below shows, the data is roughly normal.
skewness((df_pitching_init2$Worth))[1] 0.086
ggplot2::qplot(df_pitching_init2$Worth, main="Total Pitching Worth Dataset") + geom_histogram(colour="black", fill="grey") + theme_bw()`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
min(df_pitching_init2$Worth)[1] 0.32
max(df_pitching_init2$Worth)[1] 4.6
ggpubr::ggqqplot(df_pitching_init2$Worth)
shapiro.test(df_pitching_init2$Worth)
Shapiro-Wilk normality test
data: df_pitching_init2$Worth
W = 1, p-value = 0.002
While it looks like many of the top players have low worth scores, it is because we haven’t applied a modifier for IP yet. Wins are harder to come by relative to any other stat and require more innings pitched.
df_pitching_init2_raw = df_pitching_init %>%
# arrange(player_id,year) %>%
group_by(Season) %>%
mutate(
Wins_share_raw = order(order(rank(W,ties.method = 'average'),decreasing = FALSE))/n(),
SO_share_raw = order(order(rank(SO,ties.method = 'average'),decreasing = FALSE))/n(),
SV_share_raw = order(order(rank(SV,ties.method = 'average'),decreasing = FALSE))/n(),
WHIP_share = order(order(rank(WHIP,ties.method = 'average'),decreasing = FALSE))/n(),
ERA_share = order(order(rank(ERA,ties.method = 'average'),decreasing = FALSE))/n(),
HLD_share_raw = 0,
Worth = Wins_share_raw+SO_share_raw+SV_share_raw+WHIP_share+ERA_share+HLD_share_raw
) %>%
ungroup() %>%
select(-W,-SO,-SV,-WHIP,-ERA,-HLD)
options(digits=2)
df_pitching_init2021_raw =
df_pitching_init2_raw %>%
group_by(Name) %>%
filter(Season == 2021) %>%
arrange(desc(Worth)) %>%
select(Name,Wins_share_raw,SO_share_raw,SV_share_raw,WHIP_share,ERA_share,HLD_share_raw,Worth)
df_pitching_init2021_raw %>%
filter (Worth>2.9) %>%
kbl() %>%
kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T)| Name | Wins_share_raw | SO_share_raw | SV_share_raw | WHIP_share | ERA_share | HLD_share_raw | Worth |
|---|---|---|---|---|---|---|---|
| Daniel Bard | 0.79 | 0.69 | 0.97 | 0.87 | 0.75 | 0 | 4.0 |
| Garrett Richards | 0.79 | 0.84 | 0.86 | 0.87 | 0.68 | 0 | 4.0 |
| Jesus Luzardo | 0.77 | 0.78 | 0.55 | 0.89 | 0.92 | 0 | 3.9 |
| Brady Singer | 0.72 | 0.89 | 0.66 | 0.82 | 0.69 | 0 | 3.8 |
| Brad Keller | 0.86 | 0.85 | 0.31 | 0.92 | 0.77 | 0 | 3.7 |
| Jose Alvarado | 0.82 | 0.60 | 0.90 | 0.87 | 0.51 | 0 | 3.7 |
| Mitch Keller | 0.69 | 0.76 | 0.40 | 0.96 | 0.87 | 0 | 3.7 |
| Justus Sheffield | 0.81 | 0.56 | 0.37 | 0.98 | 0.95 | 0 | 3.7 |
| Nick Pivetta | 0.89 | 0.95 | 0.72 | 0.51 | 0.60 | 0 | 3.7 |
| Rafael Montero | 0.65 | 0.34 | 0.91 | 0.81 | 0.91 | 0 | 3.6 |
| Josh Fleming | 0.93 | 0.58 | 0.78 | 0.59 | 0.73 | 0 | 3.6 |
| Alec Mills | 0.74 | 0.73 | 0.71 | 0.69 | 0.73 | 0 | 3.6 |
| Joe Jimenez | 0.75 | 0.49 | 0.73 | 0.80 | 0.85 | 0 | 3.6 |
| Paul Fry | 0.58 | 0.52 | 0.83 | 0.80 | 0.87 | 0 | 3.6 |
| Wil Crowe | 0.61 | 0.82 | 0.53 | 0.84 | 0.78 | 0 | 3.6 |
| Erick Fedde | 0.82 | 0.87 | 0.38 | 0.70 | 0.78 | 0 | 3.5 |
| Adam Ottavino | 0.77 | 0.62 | 0.93 | 0.70 | 0.51 | 0 | 3.5 |
| Bryan Garcia | 0.50 | 0.22 | 0.85 | 0.98 | 0.98 | 0 | 3.5 |
| Alex Reyes | 0.92 | 0.77 | 0.98 | 0.58 | 0.25 | 0 | 3.5 |
| Yusei Kikuchi | 0.83 | 0.94 | 0.60 | 0.54 | 0.57 | 0 | 3.5 |
| Zach Davies | 0.74 | 0.83 | 0.19 | 0.87 | 0.83 | 0 | 3.5 |
| Tarik Skubal | 0.87 | 0.94 | 0.64 | 0.45 | 0.55 | 0 | 3.5 |
| Vladimir Gutierrez | 0.90 | 0.74 | 0.49 | 0.66 | 0.66 | 0 | 3.4 |
| Brett de Geus | 0.52 | 0.34 | 0.65 | 0.95 | 0.98 | 0 | 3.4 |
| Andrew Heaney | 0.86 | 0.92 | 0.30 | 0.53 | 0.84 | 0 | 3.4 |
| Ryan Weathers | 0.63 | 0.63 | 0.79 | 0.63 | 0.76 | 0 | 3.4 |
| Patrick Corbin | 0.88 | 0.90 | 0.09 | 0.72 | 0.84 | 0 | 3.4 |
| Kris Bubic | 0.77 | 0.84 | 0.61 | 0.63 | 0.58 | 0 | 3.4 |
| Daniel Lynch | 0.63 | 0.47 | 0.62 | 0.90 | 0.81 | 0 | 3.4 |
| Eduardo Rodriguez | 0.97 | 0.96 | 0.19 | 0.64 | 0.66 | 0 | 3.4 |
| J.A. Happ | 0.91 | 0.86 | 0.07 | 0.73 | 0.83 | 0 | 3.4 |
| Chris Paddack | 0.83 | 0.79 | 0.56 | 0.45 | 0.73 | 0 | 3.4 |
| Cole Irvin | 0.93 | 0.87 | 0.48 | 0.55 | 0.52 | 0 | 3.4 |
| Griffin Canning | 0.71 | 0.56 | 0.54 | 0.75 | 0.80 | 0 | 3.4 |
| Matt Harvey | 0.73 | 0.77 | 0.15 | 0.81 | 0.88 | 0 | 3.3 |
| Keegan Akin | 0.37 | 0.70 | 0.50 | 0.85 | 0.93 | 0 | 3.3 |
| Dane Dunning | 0.70 | 0.83 | 0.50 | 0.70 | 0.59 | 0 | 3.3 |
| Chris Stratton | 0.80 | 0.73 | 0.93 | 0.50 | 0.37 | 0 | 3.3 |
| Cesar Valdez | 0.27 | 0.38 | 0.92 | 0.91 | 0.84 | 0 | 3.3 |
| Ben Bowden | 0.51 | 0.35 | 0.57 | 0.97 | 0.92 | 0 | 3.3 |
| Tanner Rainey | 0.19 | 0.35 | 0.87 | 0.93 | 0.97 | 0 | 3.3 |
| Matt Manning | 0.62 | 0.49 | 0.58 | 0.79 | 0.83 | 0 | 3.3 |
| Dallas Keuchel | 0.88 | 0.76 | 0.09 | 0.81 | 0.76 | 0 | 3.3 |
| Jeurys Familia | 0.88 | 0.62 | 0.68 | 0.66 | 0.45 | 0 | 3.3 |
| Kyle Finnegan | 0.67 | 0.59 | 0.94 | 0.74 | 0.34 | 0 | 3.3 |
| Jake Arrieta | 0.63 | 0.70 | 0.03 | 0.96 | 0.97 | 0 | 3.3 |
| Jorge Lopez | 0.45 | 0.83 | 0.25 | 0.89 | 0.86 | 0 | 3.3 |
| Aroldis Chapman | 0.73 | 0.77 | 0.99 | 0.52 | 0.27 | 0 | 3.3 |
| Tanner Scott | 0.69 | 0.61 | 0.40 | 0.84 | 0.74 | 0 | 3.3 |
| Dylan Cease | 0.98 | 0.99 | 0.45 | 0.42 | 0.44 | 0 | 3.3 |
| James Karinchak | 0.83 | 0.68 | 0.94 | 0.35 | 0.47 | 0 | 3.3 |
| Phil Maton | 0.75 | 0.72 | 0.42 | 0.72 | 0.65 | 0 | 3.3 |
| Jordan Lyles | 0.91 | 0.91 | 0.07 | 0.64 | 0.74 | 0 | 3.3 |
| Gregory Soto | 0.76 | 0.66 | 0.97 | 0.59 | 0.28 | 0 | 3.3 |
| Griffin Jax | 0.62 | 0.58 | 0.57 | 0.59 | 0.90 | 0 | 3.3 |
| Kyle Hendricks | 0.98 | 0.89 | 0.16 | 0.57 | 0.67 | 0 | 3.3 |
| Zach Plesac | 0.93 | 0.80 | 0.55 | 0.32 | 0.64 | 0 | 3.2 |
| Hansel Robles | 0.43 | 0.65 | 0.95 | 0.62 | 0.58 | 0 | 3.2 |
| Codi Heuer | 0.84 | 0.48 | 0.85 | 0.53 | 0.54 | 0 | 3.2 |
| Jon Gray | 0.86 | 0.93 | 0.28 | 0.55 | 0.62 | 0 | 3.2 |
| Matt Peacock | 0.71 | 0.43 | 0.58 | 0.83 | 0.69 | 0 | 3.2 |
| Carlos Estevez | 0.45 | 0.52 | 0.94 | 0.76 | 0.56 | 0 | 3.2 |
| JT Brubaker | 0.69 | 0.88 | 0.39 | 0.50 | 0.77 | 0 | 3.2 |
| German Marquez | 0.96 | 0.95 | 0.28 | 0.46 | 0.57 | 0 | 3.2 |
| Shane McClanahan | 0.93 | 0.90 | 0.62 | 0.46 | 0.31 | 0 | 3.2 |
| Logan Gilbert | 0.77 | 0.88 | 0.64 | 0.27 | 0.65 | 0 | 3.2 |
| Ryan Yarbrough | 0.89 | 0.84 | 0.35 | 0.38 | 0.73 | 0 | 3.2 |
| Rex Brothers | 0.43 | 0.65 | 0.69 | 0.68 | 0.75 | 0 | 3.2 |
| Adbert Alzolay | 0.69 | 0.88 | 0.75 | 0.25 | 0.62 | 0 | 3.2 |
| Kenta Maeda | 0.76 | 0.83 | 0.45 | 0.51 | 0.64 | 0 | 3.2 |
| Rafael Dolis | 0.26 | 0.31 | 0.86 | 0.95 | 0.81 | 0 | 3.2 |
| Amir Garrett | 0.04 | 0.53 | 0.92 | 0.83 | 0.86 | 0 | 3.2 |
| Bruce Zimmermann | 0.62 | 0.48 | 0.58 | 0.79 | 0.72 | 0 | 3.2 |
| Luis Castillo | 0.86 | 0.96 | 0.31 | 0.60 | 0.46 | 0 | 3.2 |
| Ryan Helsley | 0.76 | 0.40 | 0.75 | 0.67 | 0.61 | 0 | 3.2 |
| Trevor Stephan | 0.51 | 0.65 | 0.78 | 0.66 | 0.57 | 0 | 3.2 |
| Kyle Freeland | 0.81 | 0.82 | 0.33 | 0.67 | 0.55 | 0 | 3.2 |
| Luis Garcia | 0.95 | 0.95 | 0.66 | 0.30 | 0.32 | 0 | 3.2 |
| Mike Mayers | 0.68 | 0.74 | 0.83 | 0.49 | 0.43 | 0 | 3.2 |
| Eli Morgan | 0.71 | 0.69 | 0.57 | 0.43 | 0.76 | 0 | 3.2 |
| Sam Hentges | 0.20 | 0.60 | 0.45 | 0.96 | 0.93 | 0 | 3.1 |
| Ryan Hendrix | 0.71 | 0.26 | 0.51 | 0.82 | 0.85 | 0 | 3.1 |
| Daniel Norris | 0.30 | 0.51 | 0.71 | 0.76 | 0.87 | 0 | 3.1 |
| Trevor May | 0.78 | 0.70 | 0.88 | 0.43 | 0.35 | 0 | 3.1 |
| Drew Smyly | 0.94 | 0.84 | 0.15 | 0.61 | 0.59 | 0 | 3.1 |
| Alex Colome | 0.54 | 0.50 | 0.96 | 0.64 | 0.49 | 0 | 3.1 |
| Kwang-hyun Kim | 0.84 | 0.69 | 0.80 | 0.49 | 0.31 | 0 | 3.1 |
| Taylor Hearn | 0.76 | 0.76 | 0.43 | 0.53 | 0.63 | 0 | 3.1 |
| Brad Hand | 0.73 | 0.53 | 0.97 | 0.45 | 0.44 | 0 | 3.1 |
| Kyle Zimmer | 0.57 | 0.38 | 0.83 | 0.65 | 0.68 | 0 | 3.1 |
| Adam Plutko | 0.18 | 0.37 | 0.73 | 0.90 | 0.94 | 0 | 3.1 |
| Josiah Gray | 0.40 | 0.66 | 0.66 | 0.60 | 0.78 | 0 | 3.1 |
| Kyle Funkhouser | 0.83 | 0.57 | 0.77 | 0.65 | 0.30 | 0 | 3.1 |
| JC Mejia | 0.24 | 0.40 | 0.59 | 0.88 | 0.99 | 0 | 3.1 |
| Tylor Megill | 0.62 | 0.79 | 0.61 | 0.48 | 0.60 | 0 | 3.1 |
| Bryan Shaw | 0.73 | 0.62 | 0.81 | 0.62 | 0.32 | 0 | 3.1 |
| Josh Sborz | 0.60 | 0.61 | 0.75 | 0.67 | 0.45 | 0 | 3.1 |
| Tyler Mahle | 0.98 | 0.98 | 0.34 | 0.38 | 0.40 | 0 | 3.1 |
| Hyun-Jin Ryu | 0.99 | 0.90 | 0.25 | 0.37 | 0.56 | 0 | 3.1 |
| Austin Gomber | 0.89 | 0.83 | 0.35 | 0.39 | 0.60 | 0 | 3.1 |
| Max Kranick | 0.39 | 0.22 | 0.63 | 0.94 | 0.89 | 0 | 3.1 |
| Steven Matz | 0.98 | 0.91 | 0.21 | 0.55 | 0.42 | 0 | 3.1 |
| Luis Oviedo | 0.25 | 0.20 | 0.63 | 0.99 | 0.99 | 0 | 3.1 |
| Martin Perez | 0.78 | 0.77 | 0.07 | 0.78 | 0.66 | 0 | 3.1 |
| Spencer Howard | 0.11 | 0.44 | 0.65 | 0.88 | 0.97 | 0 | 3.1 |
| Ian Anderson | 0.90 | 0.86 | 0.55 | 0.39 | 0.35 | 0 | 3.0 |
| Aaron Nola | 0.89 | 0.99 | 0.33 | 0.22 | 0.63 | 0 | 3.0 |
| Jake Brentz | 0.69 | 0.66 | 0.84 | 0.47 | 0.38 | 0 | 3.0 |
| Enyel De Los Santos | 0.35 | 0.41 | 0.44 | 0.95 | 0.90 | 0 | 3.0 |
| Matt Barnes | 0.74 | 0.71 | 0.98 | 0.21 | 0.41 | 0 | 3.0 |
| Matt Andriese | 0.29 | 0.42 | 0.70 | 0.88 | 0.75 | 0 | 3.0 |
| Sean Manaea | 0.95 | 0.96 | 0.32 | 0.38 | 0.44 | 0 | 3.0 |
| Zac Gallen | 0.61 | 0.90 | 0.49 | 0.50 | 0.54 | 0 | 3.0 |
| J.B. Wendelken | 0.57 | 0.31 | 0.83 | 0.78 | 0.54 | 0 | 3.0 |
| Greg Holland | 0.41 | 0.45 | 0.92 | 0.57 | 0.68 | 0 | 3.0 |
| Alex Young | 0.35 | 0.36 | 0.44 | 0.96 | 0.92 | 0 | 3.0 |
| Matt Foster | 0.37 | 0.33 | 0.77 | 0.70 | 0.85 | 0 | 3.0 |
| Bryan Abreu | 0.47 | 0.27 | 0.74 | 0.73 | 0.82 | 0 | 3.0 |
| Kolby Allard | 0.49 | 0.81 | 0.46 | 0.48 | 0.78 | 0 | 3.0 |
| Junior Guerra | 0.64 | 0.53 | 0.05 | 0.94 | 0.86 | 0 | 3.0 |
| Heath Hembree | 0.29 | 0.71 | 0.93 | 0.30 | 0.79 | 0 | 3.0 |
| Archie Bradley | 0.80 | 0.32 | 0.82 | 0.68 | 0.39 | 0 | 3.0 |
| Luis Patino | 0.72 | 0.64 | 0.65 | 0.46 | 0.54 | 0 | 3.0 |
| Trevor Williams | 0.59 | 0.74 | 0.36 | 0.76 | 0.55 | 0 | 3.0 |
| Caleb Smith | 0.57 | 0.86 | 0.28 | 0.62 | 0.68 | 0 | 3.0 |
| David Peterson | 0.38 | 0.61 | 0.57 | 0.65 | 0.79 | 0 | 3.0 |
| Jake Diekman | 0.41 | 0.70 | 0.91 | 0.56 | 0.43 | 0 | 3.0 |
| Triston McKenzie | 0.70 | 0.89 | 0.42 | 0.29 | 0.70 | 0 | 3.0 |
| Jeff Hoffman | 0.48 | 0.68 | 0.38 | 0.85 | 0.61 | 0 | 3.0 |
| Lou Trivino | 0.81 | 0.59 | 0.97 | 0.41 | 0.22 | 0 | 3.0 |
| Sean Newcomb | 0.33 | 0.36 | 0.74 | 0.93 | 0.65 | 0 | 3.0 |
| James Kaprielian | 0.87 | 0.86 | 0.43 | 0.37 | 0.47 | 0 | 3.0 |
| Lucas Sims | 0.66 | 0.66 | 0.92 | 0.20 | 0.56 | 0 | 3.0 |
| Blake Snell | 0.80 | 0.95 | 0.21 | 0.53 | 0.51 | 0 | 3.0 |
| Brent Suter | 0.96 | 0.60 | 0.72 | 0.52 | 0.20 | 0 | 3.0 |
| Randy Dobnak | 0.25 | 0.15 | 0.79 | 0.82 | 0.98 | 0 | 3.0 |
| Jon Lester | 0.78 | 0.75 | 0.04 | 0.78 | 0.65 | 0 | 3.0 |
| Jose Urena | 0.55 | 0.59 | 0.14 | 0.87 | 0.83 | 0 | 3.0 |
| Vince Velasquez | 0.43 | 0.80 | 0.13 | 0.74 | 0.89 | 0 | 3.0 |
| Jacob Webb | 0.70 | 0.23 | 0.76 | 0.79 | 0.50 | 0 | 3.0 |
| Mike Minor | 0.85 | 0.92 | 0.10 | 0.39 | 0.72 | 0 | 3.0 |
| Cristian Javier | 0.60 | 0.88 | 0.84 | 0.29 | 0.34 | 0 | 3.0 |
| Paul Sewald | 0.92 | 0.81 | 0.94 | 0.10 | 0.19 | 0 | 3.0 |
| Chad Kuhl | 0.67 | 0.65 | 0.28 | 0.68 | 0.68 | 0 | 3.0 |
| Paolo Espino | 0.64 | 0.75 | 0.69 | 0.33 | 0.53 | 0 | 3.0 |
| David Price | 0.63 | 0.49 | 0.68 | 0.67 | 0.46 | 0 | 2.9 |
| Jordan Montgomery | 0.75 | 0.94 | 0.35 | 0.47 | 0.43 | 0 | 2.9 |
| Aaron Bummer | 0.68 | 0.65 | 0.84 | 0.44 | 0.33 | 0 | 2.9 |
| Jose Suarez | 0.87 | 0.72 | 0.54 | 0.39 | 0.40 | 0 | 2.9 |
| Paul Campbell | 0.39 | 0.14 | 0.63 | 0.86 | 0.91 | 0 | 2.9 |
| Bryse Wilson | 0.51 | 0.39 | 0.55 | 0.71 | 0.77 | 0 | 2.9 |
| Carlos Hernandez | 0.77 | 0.64 | 0.65 | 0.48 | 0.38 | 0 | 2.9 |
| Hector Neris | 0.56 | 0.78 | 0.95 | 0.26 | 0.37 | 0 | 2.9 |
| Stefan Crichton | 0.05 | 0.02 | 0.89 | 0.99 | 0.96 | 0 | 2.9 |
| Reid Detmers | 0.26 | 0.04 | 0.67 | 0.97 | 0.97 | 0 | 2.9 |
| Aaron Civale | 0.96 | 0.79 | 0.51 | 0.21 | 0.43 | 0 | 2.9 |
| Riley Smith | 0.23 | 0.27 | 0.78 | 0.77 | 0.86 | 0 | 2.9 |
NATotal Rankings for the players (Using 5x5 Scoring) can be found here. While it looks like many of the top players have low worth scores, it is because we haven’t applied a modifier for IP yet.
options(digits=2)
df_pitching_init2021 =
df_pitching_init2 %>%
group_by(Name) %>%
filter(Season == 2021) %>%
arrange(desc(Worth)) %>%
select(Name,Wins_share,SO_share,SV_share,WHIP_share,ERA_share,HLD_share,Worth)
df_pitching_init2021 %>%
filter (Worth>3.9) %>%
kbl() %>%
kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T)| Name | Wins_share | SO_share | SV_share | WHIP_share | ERA_share | HLD_share | Worth |
|---|---|---|---|---|---|---|---|
| Daniel Bard | 0.93 | 0.85 | 0.97 | 0.87 | 0.75 | 0 | 4.4 |
| Joe Jimenez | 0.98 | 0.86 | 0.76 | 0.80 | 0.85 | 0 | 4.2 |
| Paul Fry | 0.83 | 0.87 | 0.82 | 0.80 | 0.87 | 0 | 4.2 |
| Rafael Dolis | 0.63 | 0.84 | 0.89 | 0.95 | 0.81 | 0 | 4.1 |
| Ben Bowden | 0.83 | 0.82 | 0.57 | 0.97 | 0.92 | 0 | 4.1 |
| Jose Alvarado | 0.96 | 0.85 | 0.89 | 0.87 | 0.51 | 0 | 4.1 |
| Tanner Rainey | 0.22 | 0.93 | 0.89 | 0.93 | 0.97 | 0 | 3.9 |
| Sean Newcomb | 0.62 | 0.93 | 0.79 | 0.93 | 0.65 | 0 | 3.9 |
| Ryan Hendrix | 1.00 | 0.73 | 0.51 | 0.82 | 0.85 | 0 | 3.9 |
NANot all variables can be used for predictive modeling.
df_pitching_init3 = df_pitching_init2
#Be careful about RS - Run Support and RS/9
Lag Share Variables to use for predictive modeling. The variables that we created for the Worth metric must also be removed. This will create the final dataset.
#Order the dataset by lag columns
df_pitching_init4 = arrange(df_pitching_init3, playerid,Season) #playerid is the Fangraph id assigned to each player
# Convert dataframe to data.table format
DT_pitcher2 = data.table(df_pitching_init4)
#designate columns to lag - just the new shares
cols1 = (c('Wins_share','SO_share','SV_share', 'ERA_share','WHIP_share','HLD_share','Worth'))
anscols = paste("lag", cols1, sep="_")
DT_pitcher2[, (anscols) := data.table::shift(.SD, 1, NA, "lag"),by ='playerid', .SDcols=cols1] #Create 1 period lags by year
df_pitching_final = as.data.frame(DT_pitcher2) %>%
select(-c(Wins_share,SO_share,SV_share, ERA_share,WHIP_share,HLD_share,Name))%>%
select(-FIP,-(RAR:WPA),-(wFB:wCH),-(`ERA-`:`xFIP-`),
-SIERA,-(`RA9-WAR`:`Age Rng`),-kwERA,-(`wCH (pi)`:`wSL (pi)`),-(`K/9+`:`HR/FB%+`)) %>% select(-W,-SO,-SV,-HLD,-W_IP,-SO_IP,-SV_IP,-WHIP,-ERA,-HLD_IP)We split the data into Training Data (which is used to create the model) and test data (which is used to validate the model)
set.seed(15674) # For reproducibility
# Create index for testing and training data
inTrain <- createDataPartition(y = df_pitching_final$Worth, p = 0.80, list = FALSE)
# subset pitching data for training
tr_2021 <- df_pitching_final[inTrain,]
# subset the rest to test and validate trained model
te_2021 <- df_pitching_final[-inTrain,]
nrow(tr_2021)/nrow(df_pitching_final) #check if split is 0.8[1] 0.8
Vtreat Package in R is excellent for treating data before using for modeling. Additional documentation can be found here.